Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life.” Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects such as films, music, stage shows, comics, journalism, video games, technology, and food-related projects. source https://en.wikipedia.org/wiki/Kickstarter
I’ve been using Kickstarter for a while and I find it a good source of inspiration and for purchasing awesome stuff, especially when it comes to products.
This dataset was obtained from Kaggle https://www.kaggle.com/kemical/kickstarter-projects/data. It contains around 378,000 projects with more than 13 variables, Columns are self- explanatory; you may need to visit the platform to get an understanding of it.
Let’s first start with a summary of the data and data types. It allows us to get a perspective when asking questions or whether we need to need to make any adjustments before visualizing some of the code blocks.
## ID name
## Min. :5.971e+03 New EP/Music Development: 41
## 1st Qu.:5.383e+08 Canceled (Canceled) : 13
## Median :1.075e+09 Music Video : 11
## Mean :1.075e+09 N/A (Canceled) : 11
## 3rd Qu.:1.610e+09 Cancelled (Canceled) : 10
## Max. :2.147e+09 Debut Album : 10
## (Other) :378565
## category main_category currency
## Product Design: 22314 Film & Video: 63585 USD :295365
## Documentary : 16139 Music : 51918 GBP : 34132
## Music : 15727 Publishing : 39874 EUR : 17405
## Tabletop Games: 14180 Games : 35231 CAD : 14962
## Shorts : 12357 Technology : 32569 AUD : 7950
## Video Games : 11830 Design : 30070 SEK : 1788
## (Other) :286114 (Other) :125414 (Other): 7059
## deadline goal launched
## 2014-08-08: 705 Min. : 0 1970-01-01 01:00:00: 7
## 2014-08-10: 558 1st Qu.: 2000 2009-09-15 05:56:28: 2
## 2014-08-07: 541 Median : 5200 2010-06-30 17:29:43: 2
## 2015-05-01: 489 Mean : 49081 2011-02-08 04:29:48: 2
## 2014-08-09: 477 3rd Qu.: 16000 2011-02-25 09:58:36: 2
## 2015-07-01: 449 Max. :100000000 2011-03-03 17:55:38: 2
## (Other) :375442 (Other) :378644
## pledged state backers
## Min. : 0 canceled : 38779 Min. : 0.0
## 1st Qu.: 30 failed :197719 1st Qu.: 2.0
## Median : 620 live : 2799 Median : 12.0
## Mean : 9683 successful:133956 Mean : 105.6
## 3rd Qu.: 4076 suspended : 1846 3rd Qu.: 56.0
## Max. :20338986 undefined : 3562 Max. :219382.0
##
## country usd_pledged p_funded
## US :292627 Min. : 0 Min. : 0
## GB : 33672 1st Qu.: 31 1st Qu.: 0
## CA : 14756 Median : 624 Median : 13
## AU : 7839 Mean : 9059 Mean : 324
## DE : 4171 3rd Qu.: 4050 3rd Qu.: 107
## N,0" : 3797 Max. :20338986 Max. :10427789
## (Other): 21799
## 'data.frame': 378661 obs. of 14 variables:
## $ ID : int 1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
## $ name : Factor w/ 375767 levels ""," IT\x92S A HOT CAPPUCCINO NIGHT ",..: 332540 135688 364966 344807 77347 206129 293463 69359 284138 290720 ...
## $ category : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
## $ main_category: Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
## $ currency : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
## $ deadline : Factor w/ 3164 levels "2009-05-03","2009-05-16",..: 2288 3042 1333 1017 2247 2463 1996 2448 1790 1863 ...
## $ goal : num 1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
## $ launched : Factor w/ 378089 levels "1970-01-01 01:00:00",..: 243292 361975 80409 46557 235943 278600 187500 274014 139367 153766 ...
## $ pledged : num 0 2421 220 1 1283 ...
## $ state : Factor w/ 6 levels "canceled","failed",..: 2 2 2 2 1 4 4 2 1 1 ...
## $ backers : int 0 15 3 1 14 224 16 40 58 43 ...
## $ country : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
## $ usd_pledged : num 0 2421 220 1 1283 ...
## $ p_funded : num 0 8.07 0.489 0.02 6.579 ...
It looks like my variables have a good sense of correct data types. We have “launched date - deadline” that needs to be parsed, But I am not planning on using it for now. So, let’s get started!
ggplot(aes(x=state), data=ks) +
geom_bar()
This plot shows the project counts per state. Funding on Kickstarter is all-or-nothing. As we can see here, a large portion of the projects fail.
Next, let’s have a look at the project submission per country.
ggplot(aes(x=country), data=ks) +
geom_bar()
As expected, the US is leading.
Next, we can see the main projects categories. In this dataset, we have a main category & a sub-category.
ggplot(aes(x=main_category), data=ks) +
geom_bar() +
theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=10, face="bold"), axis.text.x=element_text(size=10, angle=90))
As can be seen in the graph, Film & Video got the highest number of projects submitted.
Next, we will have a look at the most used currencies.
ggplot(aes(x=currency), data=ks) +
geom_bar()
Again, USD is the most used currency. I wonder whether you can specify the currency for your project or if it’s enforced by your country or bank account.
Next, I wanted to get an idea on the project fund percentage to see how far projects reach or exceed their goals.
ks %>% group_by(state) %>%
summarize(count=n(), mean=mean(p_funded)) %>%
arrange(desc(count))
## # A tibble: 6 x 3
## state count mean
## <fct> <int> <dbl>
## 1 failed 197719 9.06
## 2 successful 133956 856
## 3 canceled 38779 124
## 4 undefined 3562 57.4
## 5 live 2799 289
## 6 suspended 1846 155
Impressive! Successful projects get on average 856% of their goals! On the contrary, failed projects don’t exceed 10% of their initial goal, which makes sense.
It has 378661 rows, 13 variables, factors, and integers. In general, I would say it’s clean.
Pledges, Goals, and backers are all features that will guide me and explain my feature(s) of interest.
However, there are other features that I wish the dataset had: - Ages of people who back these projects - When/for how long do the projects get backed? 1hr, last hour of the project life? - Number of updates per project - Number of comments per project ### Did you create any new variables from existing variables in the dataset? Yes, I created a variable called p_funded, which basically the percentage of a project that gets funded.
Yes. First, I will display the ones with a normal distribution followed by the ones which are skewed to the right. The following plots follow a normal distribution:
# Force R to not use exponential notation
options(scipen=999)
ggplot(ks, aes(x=goal)) + geom_histogram()
ggplot(ks, aes(x=usd_pledged)) + geom_histogram()
ggplot(ks, aes(x=pledged)) + geom_histogram()
Again, we cannot see the distribution without scaling the plots. So, let’s scale them!
ggplot(ks, aes(x=goal)) + geom_histogram() + scale_x_log10()
ggplot(ks, aes(x=usd_pledged)) + geom_histogram() + scale_x_log10()
ggplot(ks, aes(x=pledged)) + geom_histogram() + scale_x_log10()
The following one is a little skewed to the right and has a weird gap in the middle.
ggplot(ks, aes(x=backers)) + geom_histogram()
It’s not clear without log scaling. Let’s log scale it!
ggplot(ks, aes(x=backers)) + scale_x_log10() + geom_histogram()
Here we go. I find the shape weird and this data requires further investigation.
No, the data was already clean. It was not altered in any way. I only created an additional variable.
In the next plot, we are trying to explore the relationship between the goal of the project and the amount that was pledged in USD.
ggplot(aes(x = pledged, y = goal),
data = ks) + geom_point()
Since there is a lot of data depicted in the above graph, the following things need to be done in order to make the plot readable: 1. Do log transformation, so that the patterns are more clearly visible 2. Put commas for x-axis & y-axis and force R not to display exponential notation
ggplot(aes(x = pledged, y = goal),
data = ks) + geom_point() +
scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)
Since we have a large dataset with over-plotting, the alpha aesthetic will make the points more transparent. Let’s try that.
ggplot(aes(x = pledged, y = goal),
data = ks) + geom_point(alpha = 1/10) +
scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)
Here we go. This is a very interesting plot indeed! In the next section, we are going to classify those projects based on their state. Next, let’s explore if there’s a relationship between the usd_pledged and the number of project backers.
ggplot(aes(x = (usd_pledged), y = (backers) ),data = ks) + geom_point()
This is yet another interesting plot. I think we have a positive relationship here. Let’s now limit the plot to see the relationship more clearly.
ggplot(aes(x = (usd_pledged), y = (backers) ),data = ks) + geom_point(alpha = 1/5) +
xlim(0,10000000) + ylim(0,100000)
There you go. It looks to me like it’s a positive, linear relationship. Let’s confirm that by running the Pearson’s Correlation Test.
cor(ks$backers, ks$usd_pledged)
## [1] 0.7525394
As expected, it indicates a positive relationship. Next, we are going to explore the Amount Pledged vs. Project Category.
ggplot(ks, aes(main_category, usd_pledged)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90))
Interesting! This plot shows how our data has outliers. In the final section, we will clean the data and present it in a way such that we can understand it better. It is later followed by a 5 number summary.
I did not find anything unexpected.
The strongest relationship I found was between backers and usd_pledged.
I am starting this section with a plot of Goal vs Pledged, colored by the state of the project.
ggplot(aes(x = usd_pledged, y = goal, color=state),
data = ks) + geom_point()
Okay, it’s nice. But it can be made more readable if a log transformation is done. Let’s try that.
ggplot(aes(x = usd_pledged, y = goal, color=state),
data = ks) + geom_point() +
scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)
What if we color-code it based on currency?
ggplot(aes(x = usd_pledged, y = backers, color = currency ),data = ks) + geom_point()
Interesting. Could we see digital currencies soon? Let’s add the main category to the picture.
ggplot(aes(x = usd_pledged, y = goal, color = main_category ),data = ks) + geom_point()
Interesting, Now, let’s scale it.
ggplot(aes(x = usd_pledged, y = goal, color = main_category ),data = ks) + geom_point() + scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)
Here you go. Looks like technology all over the place! I think it’s not an easy plot to interpret.
Here, we can see that most of the successful projects are closer to the x-axis and have less funding which is what this platform is perfect for.
Yeah, this was the case with backers vs. usd_pledged. When you color-code the data, you start to see the US start to take over, which is expected.
ggplot(aes(x=main_category), data=ks) +
geom_bar() + theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=10, face="bold"), axis.text.x=element_text(size=10, angle=90)) + ggtitle("Number Of Projects Per Category") + xlab("Project category") + ylab("Number of projects ")
Here, we see the number of projects published per category. My initial expectation was that most of the projects would fall under the technology category, but surprisingly, Film & Video is outperforming the categories, followed by Music. In the next plot, we gonna see how that fits into the amount pledged per $ spent on each category.
# inspired by this kernal from Kaggle https://www.kaggle.com/andrewjmah/kickstarter-exploratory-data-analysis-with-r
ggplot(ks, aes(main_category, usd_pledged)) + geom_boxplot() +
ggtitle("Amount Pledged vs. Project Category") + xlab("Project Category") +
ylab("Amount Pledged (USD)") +
theme(axis.text.x=element_text(angle=90)) +
scale_y_log10(labels=scales::comma)
First of all, looking at the median, Q1, and Q3 in each category in this plot gives you a clear idea of the amount people usually pledge per category. Generally, it looks like it’s not that far from one another (in terms of the Max, Mean etc.). Second of all, the Design, Games, and technology categories are the top categories that get pledged. This is very interesting since we saw in the previous plot how “Film & Video” was leading the submission per category, but the money says something else :)
ggplot(aes(y = goal, x = usd_pledged, color = state ),data = ks) + geom_point() + scale_x_log10(labels=scales::comma)+scale_y_log10(labels=scales::comma) + xlab("Amount Pledged (USD)") + ylab("Project Goal (USD)") +
ggtitle("Amount Pledged vs. Project Goal")
There are a couple of things I find interesting in this graph. Firstly, it shows you right away where the amount spent per $ on successful projects end up with funding. Secondly, it provides an explanation of why you see a lot of failed projects on the y-axis. One of the reasons for failure is clearly a high goal (could be a scam as well) where the x-axis shows the average. Also, the more money you aim for, the less likely you are in getting a successful pledge, It also tells you how Kickstarter may not be the place to get funding for such kind of project from crowdsourcing platforms.
In my opinion, the most important thing that I find personally effective is doing something you care about and have a genuine interest in. When I scanned the Udacity datasets that were recommended to me, I had no interest in any of the subjects that were presented. This is where, I guess, domain knowledge/interest comes in handy when asking questions. I started the datasets search and first chose a dataset from Kiva, which is a crowdsourcing platform for loans. It was painful to download/understand and clean the data; I spent a couple of days cleaning it and eventually, after cleaning it, I didn’t have enough variables to meet the dataset requirement. After this project, I started to appreciate tidy datasets. I also didn’t like R, probably because I am more familiar with python. But I liked the fact that I was exposed to it and was able to interact with its community. Finally, I think we can build models with this dataset to predict project success. That would be my next goal after exploring this dataset. This was fun.